Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

Merged

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Aug 9, 2023

Backport of #9633
Fixes #10777

CONFLICTS:

  • cloud stress test doesn't exist in this branch
  • reimplemented the broken semaphore fix, git wasn't able to apply cleanly

At a high level, the archival loop is implemented with the following steps:

while not aborted:
  wait until node is leader
  sync that offsets from previous terms are applied
  while node is leader:
    upload segments based on what is in the manifest
    replicate changes to manifest via archival stm
    wait for archival stm update to be applied (updates manifest)

The archival loop currently expects that updates to the manifest are either driven by it (or the housekeeping service, though ignoring this for the sake of example). One case where this invariant is broken is the case where the replication/apply of archival batches times out -- the actual apply of the op may proceed as a part of the raft::state_machine background fiber, and the archival loop itself may proceed without waiting, meaning the updates to the manifest may race with a subsequent iteration of the archival loop.

This commit remediates this by removing the timeout from the archival_metadata_stm::do_replicate_commands() call, forcing the caller to wait indefinitely for the op to complete. In the event of an unsuccessful wait, if the persisted_stm is still leader, it will explicitly step down to force the archival loop to re-sync in a new term.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

Bug Fixes

  • Fixes an issue with the archival upload path that could contribute to data loss when archival metadata updates took a long time to be replicated and applied.

jcsp and others added 4 commits August 9, 2023 18:39
The underlying wait already takes one, this is just
a pass-through.

(cherry picked from commit 7be3f80)
There's an upcoming change to remove the timeout from the
archival_metadata_stm. Making that change will be much simpler if the
signature of persisted_stm::wait_no_throw() is a timeout instead of a
duration, since the underlying raft::state_machine::wait() call uses the
model::no_timeout deadline value to determine whether to set a timer.

Along the way I added an optional abort source reference that is
currently unused.

(cherry picked from commit cac4ac9)
At a high level, the archival loop is implemented with the following
steps:

```
while not aborted:
  wait until node is leader
  sync that offsets from previous terms are applied
  while node is leader:
    upload segments based on what is in the manifest
    replicate changes to manifest via archival stm
    wait for archival stm update to be applied (updates manifest)
```

The archival loop currently expects that updates to the manifest are
either driven by it (or the housekeeping service, though ignoring this
for the sake of example).

One case where this invariant is broken is the case where the
replication/apply of archival batches times out -- the actual apply of
the op may proceed as a part of the raft::state_machine background
fiber, and the archival loop itself may proceed without waiting, meaning
the updates to the manifest may race with a subsequent iteration of the
archival loop.

This commit remediates this by removing the timeout from the
archival_metadata_stm::do_replicate_commands() call, forcing the caller
to wait indefinitely for the op to complete.

(cherry picked from commit 0dd1dc8)
CONFLICT:
- ntp_archiver_service has changed a bunch since 22.3, but the effective
  change is the same here: catch broken semaphore exceptions

Replication may throw a ss::broken_named_semaphore when met with a
shutting down error:

```
DEBUG 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:494 - Updating the archival_meta_stm in read-replica mode, in-sync offset: -9223372036854775808, last uploaded offset: 0, last compacted offset: -9223372036854775808
DEBUG 2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:188 - command_batch_builder::replicate called
WARN  2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:333 - error on replicating remote segment metadata: raft::errc:19
ERROR 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:252 - upload loop error: seastar::broken_named_semaphore (Semaphore broken: mutex)
```

This commit handles the exception the same we handle other shutdown exceptions:
by doing nothing.

(cherry picked from commit cdb2b39)
@piyushredpanda
Copy link
Contributor

Only failure seems to be #10140

@piyushredpanda piyushredpanda merged commit 609c6be into redpanda-data:v22.3.x Aug 10, 2023
8 checks passed
@BenPope BenPope added this to the v22.3.23 milestone Aug 17, 2023
@BenPope BenPope added the kind/backport PRs targeting a stable branch label Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants